## Warning: package 'ggplot2' was built under R version 3.6.2
## Warning: package 'patchwork' was built under R version 3.6.3
## Warning: package 'corrplot' was built under R version 3.6.3
## corrplot 0.84 loaded
## Warning: package 'dplyr' was built under R version 3.6.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
##
## Attaching package: 'readr'
## The following object is masked from 'package:scales':
##
## col_factor
## Warning: package 'vroom' was built under R version 3.6.3
## Warning: package 'skimr' was built under R version 3.6.3
## Warning: package 'tibble' was built under R version 3.6.3
## Warning: package 'tidyr' was built under R version 3.6.3
##
## Attaching package: 'purrr'
## The following object is masked from 'package:scales':
##
## discard
## Warning: package 'fuzzyjoin' was built under R version 3.6.3
## Warning: package 'alluvial' was built under R version 3.6.3
## Warning: package 'ggforce' was built under R version 3.6.3
## Warning: package 'ggridges' was built under R version 3.6.3
## Warning: package 'gganimate' was built under R version 3.6.3
## Warning: package 'ggthemes' was built under R version 3.6.3
## Warning: package 'wesanderson' was built under R version 3.6.3
## Warning: package 'kableExtra' was built under R version 3.6.3
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
## Warning: package 'forecast' was built under R version 3.6.3
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## Warning: package 'timetk' was built under R version 3.6.3
## Warning: package 'plotly' was built under R version 3.6.2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
## Warning: package 'foreach' was built under R version 3.6.2
##
## Attaching package: 'foreach'
## The following objects are masked from 'package:purrr':
##
## accumulate, when
## Warning: package 'doParallel' was built under R version 3.6.3
## Loading required package: iterators
## Warning: package 'iterators' was built under R version 3.6.2
## Loading required package: parallel
There are 42,840 hierarchical time seriesof 3 US states of California (CA), Texas (TX), and Wisconsin (WI). “Hierarchical” here means that data can be aggregated on different levels: item level, department level, product category level, and state level. The sales information reaches back from Jan 2011 to June 2016. In addition to the sales numbers, we are also given corresponding data on prices, promotions, and holidays.
The training data comes in the shape of 3 separate files:
sales_train.csv: This is a main training data. It has 1 column for each of the 1941 days from 2011-01-29 and 2016-05-22; not including the validation period of 28 days until 2016-06-19. It also includes the IDs for item, department, category, store, and state. The number of rows is 30490 for all combinations of 30490 items and 10 stores.
sell_prices.csv: The store and item IDs together with the sales price of the item as a weekly average.
calendar.csv: Dates together with related features like day-of-the week, month, year, and an 3 binary flags for whether the stores in each state allowed purchases with SNAP food stamps at this date (1) or not (0). Snap: The Supplemental Nutrition Assistance Program (SNAP) is the largest federal nutrition assistance program. SNAP provides benefits to eligible low-income individuals and families via an Electronic Benefits Transfer card. This card can be used like a debit card to purchase eligible food in authorized retail food stores.
path <- 'Data/'
train <- vroom(str_c(path,'sales_train_validation.csv'), delim = ",", col_types = cols())
prices <- vroom(str_c(path,'sell_prices.csv'), delim = ",", col_types = cols())
calendar <- read_csv(str_c(path,'calendar.csv'), col_types = cols())
(head(train))
## Warning: `...` is not empty.
##
## We detected these problematic arguments:
## * `needs_dots`
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 6 x 1,919
## id item_id dept_id cat_id store_id state_id d_1 d_2 d_3 d_4
## <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 HOBB~ HOBBIE~ HOBBIE~ HOBBI~ CA_1 CA 0 0 0 0
## 2 HOBB~ HOBBIE~ HOBBIE~ HOBBI~ CA_1 CA 0 0 0 0
## 3 HOBB~ HOBBIE~ HOBBIE~ HOBBI~ CA_1 CA 0 0 0 0
## 4 HOBB~ HOBBIE~ HOBBIE~ HOBBI~ CA_1 CA 0 0 0 0
## 5 HOBB~ HOBBIE~ HOBBIE~ HOBBI~ CA_1 CA 0 0 0 0
## 6 HOBB~ HOBBIE~ HOBBIE~ HOBBI~ CA_1 CA 0 0 0 0
## # ... with 1,909 more variables: d_5 <dbl>, d_6 <dbl>, d_7 <dbl>,
## # d_8 <dbl>, d_9 <dbl>, d_10 <dbl>, d_11 <dbl>, d_12 <dbl>, d_13 <dbl>,
## # d_14 <dbl>, d_15 <dbl>, d_16 <dbl>, d_17 <dbl>, d_18 <dbl>,
## # d_19 <dbl>, d_20 <dbl>, d_21 <dbl>, d_22 <dbl>, d_23 <dbl>,
## # d_24 <dbl>, d_25 <dbl>, d_26 <dbl>, d_27 <dbl>, d_28 <dbl>,
## # d_29 <dbl>, d_30 <dbl>, d_31 <dbl>, d_32 <dbl>, d_33 <dbl>,
## # d_34 <dbl>, d_35 <dbl>, d_36 <dbl>, d_37 <dbl>, d_38 <dbl>,
## # d_39 <dbl>, d_40 <dbl>, d_41 <dbl>, d_42 <dbl>, d_43 <dbl>,
## # d_44 <dbl>, d_45 <dbl>, d_46 <dbl>, d_47 <dbl>, d_48 <dbl>,
## # d_49 <dbl>, d_50 <dbl>, d_51 <dbl>, d_52 <dbl>, d_53 <dbl>,
## # d_54 <dbl>, d_55 <dbl>, d_56 <dbl>, d_57 <dbl>, d_58 <dbl>,
## # d_59 <dbl>, d_60 <dbl>, d_61 <dbl>, d_62 <dbl>, d_63 <dbl>,
## # d_64 <dbl>, d_65 <dbl>, d_66 <dbl>, d_67 <dbl>, d_68 <dbl>,
## # d_69 <dbl>, d_70 <dbl>, d_71 <dbl>, d_72 <dbl>, d_73 <dbl>,
## # d_74 <dbl>, d_75 <dbl>, d_76 <dbl>, d_77 <dbl>, d_78 <dbl>,
## # d_79 <dbl>, d_80 <dbl>, d_81 <dbl>, d_82 <dbl>, d_83 <dbl>,
## # d_84 <dbl>, d_85 <dbl>, d_86 <dbl>, d_87 <dbl>, d_88 <dbl>,
## # d_89 <dbl>, d_90 <dbl>, d_91 <dbl>, d_92 <dbl>, d_93 <dbl>,
## # d_94 <dbl>, d_95 <dbl>, d_96 <dbl>, d_97 <dbl>, d_98 <dbl>,
## # d_99 <dbl>, d_100 <dbl>, d_101 <dbl>, d_102 <dbl>, d_103 <dbl>,
## # d_104 <dbl>, ...
(head(prices))
## Warning: `...` is not empty.
##
## We detected these problematic arguments:
## * `needs_dots`
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 6 x 4
## store_id item_id wm_yr_wk sell_price
## <chr> <chr> <dbl> <dbl>
## 1 CA_1 HOBBIES_1_001 11325 9.58
## 2 CA_1 HOBBIES_1_001 11326 9.58
## 3 CA_1 HOBBIES_1_001 11327 8.26
## 4 CA_1 HOBBIES_1_001 11328 8.26
## 5 CA_1 HOBBIES_1_001 11329 8.26
## 6 CA_1 HOBBIES_1_001 11330 8.26
(head(calendar))
## Warning: `...` is not empty.
##
## We detected these problematic arguments:
## * `needs_dots`
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 6 x 14
## date wm_yr_wk weekday wday month year d event_name_1
## <date> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 2011-01-29 11101 Saturd~ 1 1 2011 d_1 <NA>
## 2 2011-01-30 11101 Sunday 2 1 2011 d_2 <NA>
## 3 2011-01-31 11101 Monday 3 1 2011 d_3 <NA>
## 4 2011-02-01 11101 Tuesday 4 2 2011 d_4 <NA>
## 5 2011-02-02 11101 Wednes~ 5 2 2011 d_5 <NA>
## 6 2011-02-03 11101 Thursd~ 6 2 2011 d_6 <NA>
## # ... with 6 more variables: event_type_1 <chr>, event_name_2 <chr>,
## # event_type_2 <chr>, snap_CA <dbl>, snap_TX <dbl>, snap_WI <dbl>
# (summary(train))
(summary(prices))
## store_id item_id wm_yr_wk sell_price
## Length:6841121 Length:6841121 Min. :11101 Min. : 0.010
## Class :character Class :character 1st Qu.:11247 1st Qu.: 2.180
## Mode :character Mode :character Median :11411 Median : 3.470
## Mean :11383 Mean : 4.411
## 3rd Qu.:11517 3rd Qu.: 5.840
## Max. :11621 Max. :107.320
(summary(calendar))
## date wm_yr_wk weekday wday
## Min. :2011-01-29 Min. :11101 Length:1969 Min. :1.000
## 1st Qu.:2012-06-04 1st Qu.:11219 Class :character 1st Qu.:2.000
## Median :2013-10-09 Median :11337 Mode :character Median :4.000
## Mean :2013-10-09 Mean :11347 Mean :3.997
## 3rd Qu.:2015-02-13 3rd Qu.:11502 3rd Qu.:6.000
## Max. :2016-06-19 Max. :11621 Max. :7.000
## month year d event_name_1
## Min. : 1.000 Min. :2011 Length:1969 Length:1969
## 1st Qu.: 3.000 1st Qu.:2012 Class :character Class :character
## Median : 6.000 Median :2013 Mode :character Mode :character
## Mean : 6.326 Mean :2013
## 3rd Qu.: 9.000 3rd Qu.:2015
## Max. :12.000 Max. :2016
## event_type_1 event_name_2 event_type_2 snap_CA
## Length:1969 Length:1969 Length:1969 Min. :0.0000
## Class :character Class :character Class :character 1st Qu.:0.0000
## Mode :character Mode :character Mode :character Median :0.0000
## Mean :0.3301
## 3rd Qu.:1.0000
## Max. :1.0000
## snap_TX snap_WI
## Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000
## Mean :0.3301 Mean :0.3301
## 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000
extract_ts <- function(df){
min_date <- date("2011-01-29")
df %>%
select(id, starts_with("d_")) %>%
pivot_longer(starts_with("d_"), names_to = "dates", values_to = "sales") %>%
mutate(dates = as.integer(str_remove(dates, "d_"))) %>%
mutate(dates = min_date + dates - 1) %>%
mutate(id = str_remove(id, "_validation"))
}
This is a daily sales per across CA, TX and WI.
Walmart sales are generally rising. We can make out some yearly seasonality, and a dip at Christmas, which is the only day of the year when the stores are closed.
The deeper look give us strong weekly seasonality plus possibly some additional overlaying patterns with shorter periods than yearly.
The most recent 2016 sales numbers appear to grow a bit faster than in previous years.
all.sales <- train %>%
summarise_at(vars(starts_with("d_")), sum) %>%
mutate(id = 1)
all.sales.ts <- extract_ts(all.sales)
gg <- all.sales.ts %>%
ggplot(aes(dates, sales)) +
geom_line(col = "Light blue") +
theme_tufte() +
labs(x = "Date", y = "Sales", title = "Sales from 2011-01-29 to 2016-05-22 (1941 Days)") +
theme(plot.title = element_text(hjust = 0.5))
ggplotly(gg, dynamicTicks = TRUE)
This is a sales per state on a monthly aggregate level.
CA sells more items in general, while WI was slowly catching up to TX and eventually surpassed it in the last months of our training data.
CA has pronounced dips in 2013 and 2015 that appear to be present in the other states as well, just less severe. These dips and peaks don’t appear to always occur (see 2012) but they might primarily reflect the yearly seasonality.
sale.state <- train %>%
group_by(state_id) %>%
summarise_at(vars(starts_with("d_")), sum) %>%
rename(id = state_id)
sale.state.ts <- extract_ts(sale.state) %>%
mutate(month = month(dates),
year = year(dates)) %>%
group_by(month, year, id) %>%
summarise(sales = sum(sales),
dates = min(dates)) %>%
ungroup() %>%
filter(str_detect(as.character(dates), "..-..-01")) %>%
filter(dates != max(dates))
## `summarise()` regrouping output by 'month', 'year' (override with `.groups` argument)
gg <- sale.state.ts %>%
ggplot(aes(dates, sales, col = id)) +
geom_line() +
theme_tufte() +
labs(x = "Date", y = "Sales", title = "Monthly Sales of State CA, TX, WI") +
theme(plot.title = element_text(hjust = 0.5))
ggplotly(gg, dynamicTicks = TRUE)
This is a sales per department on a monthly aggregate level.
sale.dept <- train %>%
group_by(dept_id) %>%
summarise_at(vars(starts_with("d_")), sum) %>%
rename(id = dept_id)
sale.dept.ts <- extract_ts(sale.dept) %>%
mutate(month = month(dates),
year = year(dates)) %>%
group_by(month, year, id) %>%
summarise(sales = sum(sales),
dates = min(dates)) %>%
ungroup() %>%
filter(str_detect(as.character(dates), "..-..-01")) %>%
filter(dates != max(dates))
## `summarise()` regrouping output by 'month', 'year' (override with `.groups` argument)
gg <- sale.dept.ts %>%
ggplot(aes(dates, sales, col = id)) +
geom_line() +
theme_tufte() +
labs(x = "Date", y = "Sales", title = "Monthly Sales of Departments") +
theme(plot.title = element_text(hjust = 0.5))
ggplotly(gg, dynamicTicks = TRUE)
There are 10 stores, 4 in CA and 3 each in TX and WI, and 3 categories: Foods, Hobbies, and Household. This visual uses the a facetted view for the stores (1 facet per state), including sales across all category:
The most common category is “Foods” which is followed by “Household” and then “Hobbies”. The number of “Household” rows is closer to the number of “Foods” rows than the corresponding sales figures, indicating that more “Foods” units are sold than “Household” ones.
In terms of stores, the TX stores are quite close together in sales; with “TX_3” rising from the levels of “TX_1” to the level of “TX_2” over the time. The WI stores “WI_1” and “WI_2” show a curious jump in sales in 2012, while “WI_3” shows a long dip over several year.
The CA stores are relatively different than TX and WI. It is quite noticeable that “CA_2”, declines to the “CA_4” level in 2015. And then it recover and jump up to “CA_1” sales later in the year.
sales.cat <- train %>%
group_by(cat_id) %>%
summarise_at(vars(starts_with("d_")), sum) %>%
rename(id = cat_id)
sales.store <- train %>%
group_by(store_id) %>%
summarise_at(vars(starts_with("d_")), sum) %>%
rename(id = store_id)
p1 <- extract_ts(sales.cat) %>%
mutate(month = month(dates),
year = year(dates)) %>%
group_by(month, year, id) %>%
summarise(sales = sum(sales),
dates = min(dates)) %>%
ungroup() %>%
filter(str_detect(as.character(dates), "..-..-01")) %>%
filter(dates != max(dates)) %>%
ggplot(aes(dates, sales, col = id)) +
geom_line() +
theme_hc() +
theme(legend.position = "none") +
labs(title = "Sales per Category", x = "Date", y = "Sales")+
theme(plot.title = element_text(hjust = 0.5))
## `summarise()` regrouping output by 'month', 'year' (override with `.groups` argument)
p2 <- train %>%
count(cat_id) %>%
ggplot(aes(cat_id, n, fill = cat_id)) +
geom_col() +
theme_hc() +
theme(legend.position = "none") +
theme(axis.text.x = element_text(size = 7)) +
labs(x = "", y = "", title = "Rows per Category")+
theme(plot.title = element_text(hjust = 0.5))
p3 <- extract_ts(sales.store) %>%
mutate(month = month(dates),
year = year(dates)) %>%
group_by(month, year, id) %>%
summarise(sales = sum(sales),
dates = min(dates)) %>%
ungroup() %>%
filter(str_detect(as.character(dates), "..-..-01")) %>%
filter(dates != max(dates)) %>%
mutate(state_id = str_sub(id, 1, 2)) %>%
ggplot(aes(dates, sales, col = id)) +
geom_line() +
theme_hc() +
theme(legend.position = "bottom") +
labs(title = "Sales per Store", x = "Date", y = "Sales", col = "Store ID") +
facet_wrap(~state_id)+
theme(plot.title = element_text(hjust = 0.5))
## `summarise()` regrouping output by 'month', 'year' (override with `.groups` argument)
layout <- "
AAB
CCC
"
p1 + p2 + p3 + plot_layout(design = layout)
Our data has 7 departments, 3 for “FOODS” and 2 each for “HOBBIES” and “HOUSEHOLD”. Together with the 3 states those are 21 levels. This visual is about monthly sales per department:
“FOODS_3” is clearly driving the majority of “FOODS” category sales in all states. “FOODS_2” is picking up a bit towards the end of the time range, especially in “WI”.
Similarly, “HOUSEHOLD_1” is clearly outselling “HOUSEHOLD_2”. “HOBBIES_1” is on a higher average sales level than “HOBBIES_2”, but both are not showing much development over time.
min_date <- date("2011-01-29")
state.dept <- train %>%
group_by(dept_id, state_id) %>%
summarise_at(vars(starts_with("d_")), sum) %>%
ungroup() %>%
select(ends_with("id"), starts_with("d_")) %>%
pivot_longer(starts_with("d_"), names_to = "dates", values_to = "sales") %>%
mutate(dates = as.integer(str_remove(dates, "d_"))) %>%
mutate(dates = min_date + dates - 1)
state.dept %>%
mutate(month = month(dates),
year = year(dates)) %>%
group_by(month, year, dept_id, state_id) %>%
summarise(sales = sum(sales),
dates = min(dates)) %>%
ungroup() %>%
filter(str_detect(as.character(dates), "..-..-01")) %>%
filter(dates != max(dates)) %>%
ggplot(aes(dates, sales, col = dept_id)) +
geom_line() +
facet_grid(state_id ~ dept_id) +
theme_tufte() +
theme(legend.position = "none", strip.text.x = element_text(size = 8)) +
labs(title = "Sales per Department and State", x = "Date", y = "Sales") +
theme(plot.title = element_text(hjust = 0.5))
## `summarise()` regrouping output by 'month', 'year', 'dept_id' (override with `.groups` argument)
Because of the general increasing trend in sales, I have smoothened a trend using LOESS fit which we then subtract from the data. The heatmap shows the relative changes(removal of Christmas dip) because they would be distracting for the purpose of this plot. Here is a heat map that combines the weekly and yearly seasonalities:
The weekly pattern is strong, with Sat and Sun standing out prominently. Also Monday and Friday seems to benefit a bit from the weekend effect.
The months of Nov and Dec show clear dips, while the summer months May, Jun, and Jul suggest a milder secondary dip. Certain holidays, like the 4th of July, might somewhat influence these patterns; but over 5 years they should average out reasonably well.
days <- train %>%
summarise_at(vars(starts_with("d_")), sum) %>%
mutate(id = 1)
days.count <- extract_ts(days) %>%
filter(!str_detect(as.character(dates), "-12-25")) #removed christmas dip
loess_all <- predict(loess(days.count$sales ~ as.integer(days.count$dates - min(days.count$dates)) + 1, span = 1/2, degree = 1))
days.count <- days.count %>%
mutate(loess = loess_all) %>%
mutate(sales_rel = sales - loess)
p1 <- days.count %>%
mutate(wday = wday(dates, label = TRUE, week_start = 1),
month = month(dates, label = TRUE),
year = year(dates)) %>%
group_by(wday, month, year) %>%
summarise(sales = sum(sales_rel)/1e3) %>%
ggplot(aes(month, wday, fill = sales)) +
geom_tile() +
labs(x = "Month of the year", y = "Day of the week", fill = "Relative Sales [1k]") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_distiller(palette = "Spectral") +
theme_hc()
## `summarise()` regrouping output by 'wday', 'month' (override with `.groups` argument)
p1
This is a visual to represent seasonality of weekly and monthly pattern.
After scaling, the weekday vs weekend pattern is very similar for all 3 states, except for an interesting downturn in Sunday sales in WI.
The monthly seasonalities are indeed complex. There is a dip in the winter months and a second, generally shallower dip around May. WI is again the odd state out: it sells notably less in the summer compared to TX and especially CA; so much so that the Feb/Aug ratio is inverted for WI vs CA/TX.
states <- train %>%
group_by(state_id) %>%
summarise_at(vars(starts_with("d_")), sum) %>%
rename(id = state_id)
states.mean.sales <- extract_ts(states) %>%
filter(!str_detect(as.character(dates), "-12-25")) %>%
group_by(id) %>%
mutate(loess = predict(loess(sales ~ as.integer(dates - min(dates)) + 1, span = 1/2, degree = 1)),
mean_sales = mean(sales)) %>%
mutate(sales_rel = (sales - loess)/mean_sales)
p1 <- states.mean.sales %>%
ungroup() %>%
mutate(wday = wday(dates, label = TRUE, week_start = 1)) %>%
group_by(id, wday) %>%
summarise(sales = sum(sales_rel)) %>%
ungroup() %>%
ggplot(aes(wday, sales, group = id, col = id)) +
geom_line(size = 1.5) +
theme_tufte() +
theme(legend.position = "top") +
labs(x = "", y = "Relative Sales", title = "Weekly Seasonality")+
theme(plot.title = element_text(hjust = 0.5))
## `summarise()` regrouping output by 'id' (override with `.groups` argument)
p2 <- states.mean.sales %>%
ungroup() %>%
mutate(month = month(dates, label = TRUE)) %>%
group_by(id, month) %>%
summarise(sales = sum(sales_rel)) %>%
ungroup() %>%
ggplot(aes(month, sales, group = id, col = id)) +
geom_line(size = 1.5) +
theme_tufte() +
theme(legend.position = "top") +
labs(x = "", y = "Relative Sales", title = "Monthly Seasonality")+
theme(plot.title = element_text(hjust = 0.5))
## `summarise()` regrouping output by 'id' (override with `.groups` argument)
layout <- "
AABBB
"
p1 + p2 + plot_layout(design = layout)
The calendar dataframe contains basic features like day-of-week, month, year, and of course date. Alongside the date there is also a d column which links the date to the column names in the training data. The other features deal with events and food stamps:the columns event_name_2 and event_type_2 only contain 5 values that are not NAs, so ignoring them here and only focus on the event_*_1 features.
Calendar covers about 8% of days have a special event. Of these events, about 1/3 are Religious and 1/3 are National Holidays. The remaining third is again split into 2/3 Cultural (e.g. Valentines Day) and 1/3 Sporting events (e.g. SuperBowl).
Looking at the percentage of days where purchases with SNAP food stamps are allowed in Walmart stores, we find that it is the exact same for each of the 3 states: 650 days or 33%. This is noteworthy.
p1 <- calendar %>%
filter(!is.na(event_type_1)) %>%
select(date, event_type_1) %>%
count(event_type_1) %>%
add_tally(n, name = "total") %>%
mutate(perc = n/total) %>%
ggplot(aes(reorder(event_type_1, n, FUN = min), perc, fill = event_type_1)) +
geom_col() +
scale_y_continuous(labels = scales::percent) +
coord_flip() +
theme_hc() +
theme(legend.position = "none") +
labs(x = "", y = "", title = "Percentage of Sales in events") +
theme(plot.title = element_text(hjust = 0.5))
p2 <- calendar %>%
select(date, starts_with("snap")) %>%
pivot_longer(starts_with("snap"), names_to = "state", values_to = "snap") %>%
mutate(state = str_sub(state, 6,7)) %>%
group_by(state, snap) %>%
summarise(n = n()) %>%
add_tally(n, name = "total") %>%
mutate(perc = n/total) %>%
mutate(snap = as.logical(snap)) %>%
ggplot(aes(snap, perc, fill = snap)) +
geom_col() +
scale_y_continuous(labels = scales::percent) +
theme_hc() +
facet_wrap(~ state, scales = "free") +
theme(legend.position = "none") +
labs(x = "", y = "", title = "Days with SNAP purchases: same percentage for all states") +
theme(plot.title = element_text(hjust = 0.5))
## `summarise()` regrouping output by 'state' (override with `.groups` argument)
layout <- "
AAAAA
CCCCC
"
p1 + p2 + plot_layout(design = layout)
There is a item price information for each item ID, which includes the category and department IDs, and its store ID, which includes the state ID. Here is a facet grid with overlapping density plots for price distributions within the groups of category, department, and state. Note the logarithmic scale on the x-axes:
The distributions are almost identical between the 3 states. There are some minute differences in the “FOODS” category, but this might be due the smoothing bandwidth size.
There are notable differences between the categories: FOODs are on average cheaper than HOUSEHOLD items. And HOBBIES items span a wider range of prices than the other two; even suggesting a second peak at lower prices.
Among the three food categories, FOODS_3 does not contain a high-price tail.
The HOBBIES category is the most diverse one, with both departments having quite broad distributions but “HOBBIES_1” accounting for almost all of the items above $10. “HOBBIES_2” has a bimodal structure.
The HOUSEHOLD price distributions are quite similar, but “HOUSEHOLD_2” peaks at clearly higher prices than “HOUSEHOLD_1”.
item.prices <- prices %>%
mutate(cat_id = str_sub(item_id, 1, -7)) %>%
mutate(dept_id = str_sub(item_id, -5, -5)) %>%
mutate(state = str_sub(store_id, 1, 2))
item.prices %>%
ggplot(aes(sell_price, fill = dept_id)) +
geom_density(bw = 0.1, alpha = 0.5) +
scale_x_log10(breaks = c(0.5, 1, 5, 10, 50)) +
coord_cartesian(xlim = c(0.3, 60)) +
# facet_wrap(~ cat_id, nrow = 3) +
facet_grid(cat_id ~ state) +
theme_hc() +
theme(legend.position = "bottom") +
labs(x = "Average Sales Price [$]", y = "", fill = "Department",
title = "Item Prices vary by Category and Department",
subtitle = "But distributions are almost identical from State to State")+
theme(plot.title = element_text(hjust = 0.5))+
theme(plot.subtitle = element_text(hjust = 0.5))
These are the facet plots representing item prices per category and state over years.
Overall, the price distributions are pretty stable over the years, with only slight increases that are likely due to inflation.
An interesting evolution is visible in HOBBIES_2, which over time becomes much more bimodal: the second peak at $1 is increasing in importance until it almost reaches the level of the main peak just above 2 dollars. At the same time the small secondary peak at half a dollar in HOBBIES_1 becomes more flat after 2012.
The HOUSEHOLD departments are stable. FOODS shows small changes like the relative growth of the $1 peak of FOODS_1.
item.prices <- prices %>%
mutate(cat_id = str_sub(item_id, 1, -7)) %>%
mutate(dept_id = str_sub(item_id, -5, -5)) %>%
select(cat_id, dept_id, wm_yr_wk, sell_price) %>%
left_join(calendar %>%
select(date, wm_yr_wk) %>%
group_by(wm_yr_wk) %>%
slice(1), by = "wm_yr_wk") %>%
mutate(year = year(date),
month = month(date, label = TRUE, abbr = TRUE)) %>%
mutate(year_mon = str_sub(as.character(date), 1, 7)) %>%
ungroup()
item.prices %>%
sample_frac(0.3) %>%
ggplot(aes(x = sell_price, y = as.factor(year), fill = dept_id)) +
geom_density_ridges(bandwidth = 0.1, alpha = 0.5) +
scale_x_log10(breaks = c(0.5, 1, 2, 5, 10, 25)) +
coord_cartesian(xlim = c(0.4, 30)) +
facet_wrap(~ cat_id, nrow = 1) +
#theme_hc() +
theme(legend.position = "bottom") +
labs(x = "Average Sales Price [$]", y = "", fill = "Department",
title = "Item Prices by Category & Department - stable over the years",
subtitle = "But see HOBBIES_2 becoming bimodal + average price increases") +
theme(plot.title = element_text(hjust = 0.5)) +
theme(plot.subtitle = element_text(hjust = 0.5))